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Abstract 

We prove that the class of functions g : {— 1,+1}" — * {— 1,+1} that only depend on an 
unknown subset of k <C n variables (so-called fc-juntas) is agnostically lcarnable from a random 
walk in time polynomial in n, 2 k , e~ k , and log(l/(5). In other words, there is an algorithm 
with the claimed running time that, given e, S > and access to a random walk on { — 1, +1}" 
labeled by an arbitrary function / : { — 1, +1}" — > { — 1, +1}, finds with probability at least 1—6 
a fc-junta that is (opt(/) + e)-close to /, where opt(/) denotes the distance of a closest fc-junta 
to/. 

Keywords: agnostic learning, random walks, juntas 

1 Introduction 

1.1 Motivation 

In supervised learning, the learner is provided with a training set of labeled examples 

(x\f(x 1 )),(x 2 ,f(x 2 )),... , 

and the goal is to find a hypothesis h that is a good approximation to /, i.e., that gives good estimates 
for f(x) also on the points that are not present in the training set. In many applications, the points 
x correspond to particular states of a system and the labels f(x) correspond to classifications of 
these states. If the underlying system evolves over time and thus (x t ,f(x t )) corresponds to a 
measurement of the current state and its classification at time t, it is often reasonable to assume 
that state changes only occur locally, i.e., at each time t, x t differs only "locally" from x t_1 . Such 
phenomena occur for instance in physics or biology: e.g., in a fixed time interval, a particle can 
only travel a finite distance and the mutation of a DNA sequence can be assumed to happen in a 
single position at a time. In discrete settings, such processes are often modeled as random walks 
on graphs, in which the nodes represent the states of the system, and edges indicate possible local 
state changes. 

We are interested in studying the special case that the underlying graph is a hypercube, i.e., the 
node set is { — 1, l} n and two nodes are adjacent if and only if they differ in exactly one coordinate. 
Furthermore, we restrict the setting to Boolean classifications. This random walk learning model 
has attracted a lot of attention since the nineties [1, 3, 7, 6, 15], mainly because of its interesting 
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learning theoretic properties. The model is weaker than the membership query model in which the 
learner is allowed to ask the classifications of specific points, and it is stronger than the uniform- 
distribution model in which the learner observes points that are drawn independently of each other 
from the uniform distribution on { — 1, l} n . Moreover, the latter relation is known to be strict: under 
a standard complexity theoretic assumption (existence of one-way functions) there is a class that 
is efficiently learnable from labeled random walks, but not from independent uniformly distributed 
examples [6, Proposition 2]. 

The random walk learning model shares some similarities with both other models mentioned 
above: as in the uniform-distribution model, the examples are generated at random (so that the 
learner has no influence on the given examples) and points of the random walk that correspond to 
time points that are sufficiently far apart roughly behave like independent uniformly distributed 
points. On the other hand, some learning problems that appear to be infeasible in the uniform 
distribution model but are known to be easy to solve in the membership query model have turned 
out to be easy in the random walk model as well. Among them is the problem of learning DNFs 
with polynomially many terms [6] (even under random classifciation noise) and the problem of 
learning parity functions in the presence of random classification noise. The former result relies 
on an efficient algorithm performing the Bounded Sieve [6] introduced in [5]. The latter result 
follows from the fact that the (noise-less) random walk model admits an efficient approximation of 
variable influences, and the effect of random classification noise can be easily dealt with by drawing 
a sufficiently larger amount of examples. 

Given this success of the random walk model in learning large classes in the presence of random 
classification noise, it is natural to ask whether it can also cope with even more severe noise models. 
One elegant, albeit challenging, noise model is the agnostic learning model introduced by Kearns et 
al. [11]. In this model, no assumption whatsoever is made about the labels. Instead of asking for 
a hypothesis that is close to the classification function, the goal in agnostic learning is to produce 
a hypothesis that agrees with the labels on nearly as many points as the best fitting function 
from the target class. More formally, given a class C of Boolean functions on { — l,l} n and an 
arbitrary function / : { — 1, l} n — > { — 1,1}, let opt c (/) = mm gG c Pr[<?(x) / f{ x )]- The class C is 
agnostically learnable if there is an algorithm that, for any e, 5 > 0, produces a hypothesis h that, 
with probability at least 1 — 5, satisfies Pr[h(x) ^ f(x)] < opt^(/) + e. 

Recently, Gopalan et al. [9] have shown that the class of Boolean functions that can be repre- 
sented by decision trees of polynomial size (in the number of variables) can be learned agnostically 
from membership queries in polynomial time. Their main result combines the Kushilevitz-Mansour 
algorithm for finding large Fourier coefficients [12] with a gradient-descent algorithm [16] to solve an 
^-regression problem for sparse polynomials. They also present a simpler algorithm (with slightly 
worse running time) that properly agnostically learns the class of k -juntas. These are functions 
/ : { — 1, l} n — ► { — 1, 1} that depend on an a priori unknown subset of at most k variables. The 
term proper learning refers to the requirement that only hypotheses from the target class (here: 
/c-juntas) are produced. 

The investigation of the learnability of this class has both practical and theoretical motivation. 
Practically, the junta learning problem serves as a clean model of learning in the presence of 
irrelevant information, a core problem in data mining [4]. From a theoretical perspective, the 
problem is interesting due to its close relationship to learning DNF formulas, decision trees, and 
noisy parity functions [14]. 
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1.2 Our Results and Techniques 



The main result of this paper is that the class of /c-juntas on n variables is properly agnostically 
learnable in the random walk model in time polynomial in n (times some function in k and the 
accuracy parameter e). More precisely, we show 

Theorem 1. Let C be the class of k-juntas on n variables. There is an algorithm that, given 
e, 5 > and access to a random walk x l ,x 2 , ... on {—1, l} n that is labeled by an arbitrary function 
f : { — 1, l} n — ► { — 1, 1}, returns a k-junta h that, with probability at least 1 — 5, satisfies 

Pr[fc(aO^/(x)]<opt c (/)+e. 

The running time of this algorithm is polynomial in n, 2 fc2 , (l/e) k , and log(l/<5). 

We thus prove the first efficient learning result for agnostically learning juntas (even properly) 
in a passive learning model. 

Our main technical lemma (Lemma 3) shows that for an arbitrary function / and a fc-junta 
g, there exists another A;-junta g' that is almost as correlated with / as g is and whose relevant 
variables can be inferred from all low-level Fourier coefficients of / of a certain size. These Fourier 
coefficients can in turn be detected using the Bounded Sieve algorithm of Bshouty et al. [6] given 
a random walk labeled by /. Once a superset R of the relevant variables of g' is found, it is easy 
to derive a hypothesis that only depends on at most k variables from R and that best matches the 
given labels: For each /c-element subset J C R, the best matching function with relevant variables 
in J is obtained by taking majority votes on points that coincide in these coordinates. Similarly to 
the classical result of Angluin and Laird [2] that a (proper) hypothesis that minimizes the number 
of disagreements with the labels is close to the target function (in the PAC learning model with 
random classification noise), we show that such a hypothesis is also a good candidate to satisfy 
the agnostic learning goal in the random walk model (see Proposition 1). A similar statement has 
implicitly been shown in the agnostic PAC learning model (see the proof of Theorem 1 in [11]). 

1.3 Related Work 

Our algorithm for agnostically learning juntas in the random walk model has some similarities 
to Gopalan et al.'s recent algorithm for properly agnostically learning juntas in the membership 
query model [9]. The main differences between the approaches are in two respects: first, we 
do not explicitly calculate the quantities If k = J2s-ieS \s\<k f(S) 2 but instead use our technical 
lemma mentioned above, which may be of independent interest. Second, instead of using their 
characterization of the best fitting junta with a fixed set of relevant variables in terms of the 
Fourier spectrum of / ([9, Lemma 13]), we directly construct such a best fitting hypothesis by 
taking majority votes in ambiguous situations. 

Even though we became aware of Gopalan et al.'s result only after devising our junta learning 
algorithm we have decided to adopt much of their notation to the benefit of the readers. 

It should also be noted that a generalization of Gopalan et al.'s decision tree learning algorithm 
cannot be adapted for the random walk model in a straightforward manner: The running time 
of the only known analogue of the Kushilevitz-Mansour subroutine for the random walk model 
(i.e., the Bounded Sieve) is exponential in the level up to which the large Fourier coefficients are 
sought. In general, however, sparse polynomials can be concentrated on high levels. It would be 
interesting to see if the results in [9] can also be derived for the restriction of the class of all t-sparse 
polynomials to i-sparse polynomials of degree roughly log(t) since for every decision tree of size t, 
there is an e-close decision tree of depth 0(log(£/e)) (cf. [5]). In this case, the same result should 
hold for the random walk model. 
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1.4 Organization of This Paper 

We briefly introduce notational and technical prerequisites in Section 2. The random walk learning 
model and its agnostic variant are introduced in Section 3. Section 4 contains a concentration for 
random walks and the result on disagreement minimization in the random walk model. The main 
result on agnostically learning juntas is presented in Section 5. The Appendix contains a formal 
statement and proof of a result concerning the independence of points in a random walk (Section A) 
and an elementary proof of the concentration bound (Section B). 

2 Preliminaries 

Let N = {0,1,2,...}. For n £ N, let [n] = {1, ...,n}. For x,x' £ {-l,l} n , let x x' denote 
the vector obtained by coordinate- wise multiplication of x and x' . For i G [n], let e, denote the 
vector in which all entries are equal to +1 except in the ith position, where the entry is — 1. 
For / : {— 1, l} n — > { — 1,1}, a variable Xi is said to be relevant to f (and / depends on Xj) if 
there is an x G { — 1, l} n such that f(x e«) / f{x). For i G [n] and a G { — 1,1}, denote by 
/xi=a : { — 1) 1}"" — ► { — 1) 1} the sub-function of / obtained by letting f x . =a (x) = f(x') with x'j = Xj 
if j i and x^ = a. Thus, Xi is relevant to / if and only if f Xi =i ^ f Xi =-i- The restriction of a vector 
x £ {—1, l} ra to a subset of coordinates J C [n] is denoted by x\j £ {—1, 1}' J '. All probabilities 
and expectations in this paper are taken with respect to the uniform distribution (except when 
indicated differently). 

For /, g : { — 1, l} n — > K, define the inner product 

(f,g) = E x [f(x)g(x)} = 2' n ^ f(x)g(x) . 

xe{-i,i} n 

It is well-known that the functions xs '■ { — 1, l} n - ^ {— 1> 1}) 5" C [n], defined by xs(^) = riie5 x « 
form an orthonormal basis of the space of real-valued functions on {—1, l} n . Thus, every function 
/ : { — 1, l} n — > K has the unique Fourier expansion 

f = E 

SC[n] 

where /(5) = (f,Xs) are the Fourier coefficients of /. Let ||/||2 = {f\ f) 1 ^ 2 = E[/(x) 2 ] 1 / 2 . 
Plancherel's equation states that 

(f,g)= E /(S)S(S), (1) 

5C[n] 

and from this, Parseval's equation \\fW2 = ^2sc[n] fi^) 2 follows as the special case / ' = g. 
For /, g : { — 1, l} n — > { — 1, 1}, define the distance between / and 5 by 

A(/,0)=Pr[/(a;)^(x)] - 

and for a class C = C n of functions from { — 1, l} n to { — 1, 1}, let opt c (/) = min flg c A(/, g) be the 
distance of / to a nearest function in C. It is easily seen that A(/, g) = (1 — (/, g))/2. Furthermore, 
for a sample S = (x l , y l )i=i,..., m with x* G { — 1, l} n and 7/ G { — 1, 1}, let 

A(/,5) = l|{iG{l,...,m}|/(x l )/^}| 
be the fraction of examples in S for which the labels disagree with the labeling function /. 
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3 The Random Walk Learning Model 



3.1 Learning from Noiseless Examples 

Let C = |J neN C n be a class of functions, where each C n contains functions / : {—1, 1}™ — > {—1, 1}. 
In the random walk learning model, a learning algorithm has access to the oracle RW(/) for some 
unknown function / G C n . On the first request, RW(/) generates a point x G {—1, 1}™ according to 
the uniform distribution on { — 1, l} n and returns the example (x,f(x)), where we refer to f(x) as 
the label or the classification of the example. On subsequent requests, it selects a random coordinate 
i G [n] and returns (x e*, f(x e^)), where x is the point returned in the last query. The goal of a 
learning algorithm A is, given inputs 5, e > 0, to output a hypothesis h : {— l,l} n — > {—1,1} such 
that with probability at least 1 — <5 (taken over all possible random walks of the requested length), 
Pr[/i(x) /(x)] < e. In this case, A is said to learn f with accuracy e and confidence 1 — 5. 

The class C is learnable from random walks if there is an algorithm A that for every n, every 
/ G C n , every <5 > 0, and every e > learns / with access to RW(/) with accuracy e and confidence 
1 — 5. The class C is said to be learnable in time equal to the running time of A, which is a function 
of n, e, 5, and possibly other parameters involved in the parameterization of the class C. 

If a learning algorithm only outputs hypotheses h £ C n , it is called a proper learning algorithm. 
In this case, C is properly learnable. 

The random walk model is a passive learning model in the sense that a learning algorithm has 
no direct control on which examples it receives (as opposed to the membership query model in which 
the learner is allowed to ask for the labels of specific points x). For passive learning models, we 
may assume without loss of generality that all examples are requested at once. 

3.2 Agnostic Learning 

In the model of agnostic learning from random walks, we make no assumption whatsoever on the 
nature of the labels. Following the model of Gopalan et al. [9], we assume that there is an arbitrary 
function / : {—1,1}™ according to which the examples are labeled, i.e., a learner observes pairs 
(x,f(x)), with the points coming from a random walk. In other words, the learner has access to 
RW(/), but now / is no longer required to belong to C. We can think of the labels as originating 
from a concept g G C, with an opt c (/) fraction of labels flipped by an adversary. 

The goal of a learning algorithm is to output a hypothesis h that performs nearly as well as the 
best function of C. Let opt c (/) = min 9e c Pi x [g(x) / /(x)], where x G { — 1, l} n is drawn according 
to the uniform distribution. An algorithm agnostically learns C if, for any / : { — 1, 1}™ — > { — 1, 1}, 
given 5, e > 0, it outputs a hypothesis h : {— l,l} n such that with probability at least 1 — <5, 
Pr x [h(x) / f(x)] < opt c (/) + e. Again, if the algorithm always outputs a hypothesis h G C, then 
it is called a proper learning algorithm, and C is said to be properly agnostically learnable. 

Although all learning algorithms in this paper are proper, we believe that a word is in order 
concerning the formulation of the learning goal in improper agnostic learning. Namely, it could 
well happen that we can find a hypothesis that satisfies Pr x [h(x) / /(x)] < opt c (/), but such an h 
could be as far as 2opt c (/) from all concepts in C, which can definitely not be considered a sensible 
solution if, say, opt c (/) > 1/4. Instead, a hypothesis should rather be required to be e-close to 
some function g G C that performs best (or almost best): Pi[h(x) / <?(x)] < e for some g G C with 
Pr[g(x) / /(x)] = opt c (/) (or for some near-optimal g G C with Pi[g(x) / f(x) < opt c (/) + e'). 
Alternatively, one can require h to belong to some reasonably chosen hypothesis class 7i D C, 
e.g., the hypotheses output by the algorithm in [9] for learning decision trees of size t are t-sparse 
polynomials. In fact, that algorithm properly agnostically learns the latter class. 
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4 A Concentration Bound for Labeled Random Walks 



The following lemma estimates the probability that, after drawing a random walk x°,...,ar, the 
points x° and x e are independent. The proof (and a more formal statement) are deferred to the 
Appendix (see Lemma 4 in Section A). 

Lemma 1. Let 5 > 0, £ > nln(n/<5) and x°,...,x e be a random walk on {— l,l} n . Then, with 
proability at least 1 — 5, x° and x e are independent 1 and uniformly distributed. 

Lemma 2. Let g : {-1, l} n -► [-1, 1] and S,e>0. Let N = |~n ln(n/<5)] , 

2iV, (2N 

and x 1 , . . . , x m 6e a random walk on {— 1, l} n . T/ien, mi/t probability at least 1 — 6, 



-E^)- E *^ a 



m 
i=i 



where the expectation is taken over a uniformly distributed x. 

Although a similar result can be obtained from the more general works on concentration bounds 
for random walks by Gillman [8] and for finite Markov Chains by Lezaud [13], we give an elementary 
proof for Lemma 2 in the Appendix (see Section B). 

As an immediate consequence, the fraction of disagreements between the labels f{x % ) and the 
values h(x % ) on a random walk converge quickly to the total fraction of disagreements on all of 
{-1,1}™: 

Corollary 1. Let C = C n be a class of functions from { — l,l} n to { — 1,1}. Let e,5 > 0, f : 
{ — l,l} n — > {—1,1}, and (x l , /(:c*))i=i,..., m be a labeled random walk of length 

where N = \nln(n\C\/S)~\ . Then, with probability at least 1 — 5, for every h £ C, 

\A(h,S)-A(h,f)\<e. (2) 

Proof. Let h G C. Taking g(x) = \\h{x)-f{x)\, we obtain A(/i, S) = ±g(x) and A(h, f) = E x [g(x)], 
so that by Lemma 2, \A(h,S) — A(h, f)\ < e with probability at least 1 — S/\C\. Thus, with 
probability at least 1 — 5, (2) holds for all h G C. □ 

The following proposition shows that, similarly to the classical result by Angluin and Laird [2] 
for distribution- free PAC-learning and the analogue by Kearns et al. [11] for agnostic PAC-learning, 
also in the random walk model agnostic learning is achieved by finding a hypothesis that minimizes 
the number of disagreements with a labeled random walk of sufficient length. 



1 More precisely, we can perform an additional experiment such that conditional to some event that occurs with 
probability at least 1 — 8 (taken over the draw of the random walk and the outcome of the additional experiment), 
a; and x l are independent. For more details, see Section A in the Appendix. 
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Proposition 1. Let C = C n be a class of functions from { — 1, l} n to {—1,1}. Let e,5 > 0, 

f : { — 1, l} n — ► {—1,1}, and S = {x 1 , /(x*))i=i,..., m fre a labeled random walk of length m > 
(8iV/e 2 )m(2iV|C|/<5), w/iere iV = \nln(2n\C\/S)] . ' Let h opt G C minimize A(h,S). Then, with 
probability at least 1 — 5, A(/i opt , /) < opt c (/) + e. /n particular, the required sample size is 
polynomial inn, log|C|, 1/e, and log(l/<5). 

Proof. By Corollary 1, |A(/t,<S) — A(/i,/)| < e/2 for all h € C. In particular, all functions h € C 
with A(h, f) > opt c (/) + e have A(/i,<S) > opt c (/) + e/2, whereas all functions h £ C with 
A(/i, /) = opt c (/) have A(/i,<S) < opt c (/) + e/2. Consequently, A(/j opt ,S) < opt c (/) + e/2, and 
thus A(h opU f) < opt c (/) + e. □ 



5 Agnostically Learning Juntas 

We start with our main technical lemma that shows that whenever there is a fc-junta g at distance 
A(f,g) to some function /, then there is another fc-junta g' (in fact, a subfunction of g) at distance 
A(f,g) + e such that the relevant variables of g' can be detected by finding all low-level Fourier 
coefficients that are of a certain minimum size. 

Lemma 3. Let f : { — 1, l} n — > { — 1, 1} 6e an arbitrary function and g : {— 1, l} n — ► {—1, 1} 6e a 
k-junta. Then, for every e > 0, i/iere exists a k-junta g' such that {f,g') > {f,g) — e and /or a// 
relevant variables x,- L of g' , there exists S C [n] wraf/i |5| < k, i £ S , and 

1/(^)1 > C ■ 2-^-^/2 . e , (3) 

w/iere C = 1 - l/>/2 « 0.293. 

Proof. The proof is by induction on k. For k = 0, there is nothing to show since there are no 
relevant variables. For the induction step, let k > 0. Assume that taking </ to be g does not satisfy 
the conclusion, i.e., for some relevant variable Xi of g, \f(S)\ < C2-( k ~ v > '/ 2 ! e for all 5 C [n] with 
| *S f | < k and i £ S. Our goal is to show that in this case, either g Xl= \ or g Xi =-i is well correlated 
with / and thus asserts the existence of an appropriate (k — l)-junta g' . 
Let T = {SC[ n ]\ g(S) / 0}. Then |T| < 2 k . It follows that 

(f, 9 ) = Y.fWsw = E hs)m+ E hms) 

ser s&r-.i&s SeT-.igs 

< Y \g(S)\-c-2^ k - i y 2 .e+ Y hs)g(S) 

SeT-.ieS SeT-.igs 

< 2 ( fc - 1 V 2 .C-2-( fc - 1 )/ 2 e+ f(S)g(S) = C-e+ Y f(S)g(S), 

s&r-.igs s&r-.i^s 

where the first equation is PlancherePs equation (1) and the second inequality follows by Cauchy- 
Schwartz (note that g(S) is supported on at most 2 k ~ 1 sets 5 with i £ S). Consequently, 

Y f(S)g(S) > (f,g)-C-e. 
ser-.i^s 



Since for S C [n], 



(g^iS) + g^Zr 1 (S)) 12 



if % £ S 
g(S) ifigs 



7 



it follows that 

(f,g Xi =a) > (f,g)-C-e 

for a = 1 or for a = — 1. Now g Xi=a is a (k — l)-junta, so by induction hypothesis, there exists some 
(k — l)-junta g' such that 

(/, g') > (f, g Xi =a) -e/V2>(f,g)-C-e- e/^2 = (/, g) - e 

and for all Xj relevant to g', there exists S C [n] with \S\ < k — 1, i E S, and 

> C ■ 2-( fc - 2 )/ 2 • e/v^ = C ■ 2- {k -^/ 2 ■ e . 

□ 

One might wonder if for / : { — 1, l} n — ► {—1, 1} and a fc-junta g : { — 1, l} n — > {—1, 1}, (f,g) > e 
does not imply that for every relevant variable Xj of 5, there exists S C [n] with |5| < k, i <E S, 
such that (3) holds. First of all, if /(x) = xi A . . . A x^ (interpreting —1 as true and +1 as false), 
then for all S C [n] with 5^0, |/(5)| < 2 -fc+1 . So taking g = f, the prior statement cannot hold. 

Still, one might at least hope for a similar statement with the right-hand side of (3) replaced by 
something of the form 2~ poly ( fc ) -poly(e). However, if we take / as above and g(x) = X2 A . . . A x^+i, 
then (/, g) = 1 — 2~ k+1 but for all S C [n] with k + 1 G S, f(S) = (since Xk+i is not relevant 
to /). 

Next, we need a tool for finding large low-degree Fourier coefficients of an arbitrary Boolean 
function, having access to a labeled random walk. Such an algorithm is said to perform the Bounded 
Sieve (see [6, Definition 3]). Bshouty et al. [6] have shown that such an algorithm exists for the 
random walk model. More precisely, Theorems 7 and 9 in [6] imply: 

Theorem 2 (Bounded Sieve, [6]). There is an algorithm BoundedSieve(/, 8, £, S) that on input 
9 > 0, I G [n], and 5 > 0, given access to RW(/) for some f : { — 1, l} n — > {—1, 1}, outputs a list of 
S C [n\ with f(S) 2 > 9/2 such that with probability at least 1 — S, every S C [n] with \S\ < £ and 
/(5) 2 > 9 appears in it. The algorithm runs in time poly(n, 2^, 1/9, log(l/<5)), and the list contains 
at most 2/9 sets S. 

For a sample S = (x l , /(x*))j=o,..., m , a set J C [n] of size k, and an assignment a G {—1, 1}' J ', let 
s q = |{* e [ m ] I x ' l \j = a A/(x l ) = +1}| and s~ = \{i G [m] \ x l \j = af\f(x l ) = — 1}|. Obviously, a 
J-junta hj that best agrees with / on the points in S is given by hj(x) = sgn(s^ — ). In other 
words, h(x) takes on the value a G {—1, 1} that is taken on by the majority of labels in the sub-cube 
that fixes the coordinates in J to a. This function is unique except for the choice of hj(x) at points 
x with s+ 7 = s~. . The function hj differs from the labels of S in err(J) = J2 a e{-i i}l J l err ( a ) 
points, where err(a) = min-fs^ , By Proposition 1, if S is sufficiently large, then with high 
probability, the function h j approximately minimizes A(h, f) among all J-juntas h. 

We are now ready to show our main result: 

Theorem 3 (Restatement of Theorem 1). The class of k-juntas g : {—1, l} ra — ► { — 1, 1} is properly 
agnostically learnable with accuracy e and confidence 1 — 8 in the random walk model in time 
poly(n,2 fe2 ,(l/6)Mog(l/5)). 

Proof. In the following, we show that Algorithm 1 below is an agnostic learning algorithm with the 
desired running time bound. 



8 



Algorithm 1 LearnJuntas 

1: Input k, e, 5 

2: Access to RW(/) for some / : {-1, l} n -► {-1, 1} 

3: Run BoundedSieve(/, (1 - 1/V2) 2 • 2~ k+l ■ e 2 , k, 5/2) and let T be the returned list. 

4: Let R = \J{S | S G T}. 

5: For allJ C R with \J\ = k: 
6: Compute err (J). 

7: Return hj opt for some J op t that minimizes err (J). 



Denote the class of w-variate A;-juntas by C and let 7 = opt c (/). We prove that, with probability 
at least 1 — 5, 

A{h Jopt ,f)< 1 + e. 

Let g G C with A(f,g) = 7, so that (/, g) = 1 — 27. By Lemma 3, there exists g' G C such that 
(/> 5') > 1 — 27 — e (equivalently, A(f,g') < 7 + e/2) and for all relevant variables x.- L of g', there 
exists S C [n] with |5| < k, i £ S, and 

/(S) 2 > (1 - l/v^) 2 • 2-^ • e 2 . 

Consequently, with probability at least 1 — 5/2, the list T returned in Step 3 of the algorithm 
contains all of these sets S, and thus R contains all relevant variables of g' . The Bounded Sieve 
subroutine runs in time poly(n, 2 k , 1/e, log(l/<5)). 

The set J opt is chosen such that the corresponding J opt -junta hj opt minimizes the number of 
disagreements with the labels among all /c-juntas with relevant variables in R. Denote the class 
of these juntas by C{R). Since \T\ < 2 ■ (1 - l/V2)- 2 2 k - l /e 2 < 12 • 2 k /e 2 , we have \R\ < k\T\ < 
12 • k ■ 2 k /e 2 . Consequently, R contains 

subsets of size k, and log < log (2?" ■ ('f 1 )) = poly(2 fe2 , (l/e) fc ). 

By Proposition 1, with probability at least 1 — 5/2, 

A(hj opt ,f)<opt c{R) (f) + e/2, 

provided that poly(n, log 1/e, log(l/5)) = poly(n, 2 fc2 , {l/e) k , log(l/5)) examples are drawn. 

Since g' G C(R), we obtain 

A(/ lJopt ,/)<A(< 7 ',/) + e/2< 7 + e. 
The total running time of the algorithm is polynomial in n, 2 k , (l/e) k , and log(l/<5). □ 
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A Independence of Points in Random Walks 

An updating random walk is a sequence x°, (x 1 , (x 2 , 12), ■ ■ ., where x° is drawn uniformly at 
random, each it £ [n] is a coordinate drawn uniformly at random, and x l is set to x 1 ^ 1 or to 
x 1 ^ 1 ej t , each with probability 1/2. We say that in step t, coordinate it is updated. 
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Given an updating random walk x°, (x 1 , i\), (x 2 , 22); . . ., all variables will with high probability be 
updated after £ = O(nlogn) steps, so that in this case, x° and x^ can be considered as independent 
uniformly distributed random variables. More formally, let X e be the set of all updating random 
walks of length £, and let X^ ood be the set of updating random walks such that all variables have been 
updated (at least once) after £ steps. Then, conditional to the updating random walk belonging to 
Xg Qod , x° and x l are independent (and uniformly distributed). 

Since the updating random walk model is only a technical utility, we would like to say similar 
things about the "usual" random walk model, so that we do not have to take care of going back and 
forth between the models in our analyses (although that would constitute a reasonable alternative) . 

We proceed as follows. Given a (non-updating) random walk x^x 1 , . . ., we perform an addi- 
tional experiment to simulate an updating random walk (see also [6]). We then accept the (original) 
random walk if the additional experiment leads to a good updating random walk. It will then follow 
that, conditional to the random walk being accepted, x° and x e are independent. Our algorithms 
will of course not perform this experiment. Instead, we will reason in the analyses that if we 
performed the additional experiment, then we would accept the given random walk with a cer- 
tain (high) probability (taken over the draw of the random walk and the additional experiment), 
implying that certain points are independent. 

Perform the following random experiment: Given a random walk X of length £, draw a sequence 
F = (Fi, F2, . . .) of Bernoulli trials with Pv[Fj = 1] = Pr[Fj = 0] = 1/2 for each j until F contains 
£ ones. (If this is not the case after, say, L = poly(^) steps, then reject X.) Otherwise, let £' denote 
the length of F and construct a sequence / = . . . , ^/) of variable indices as follows. Denote 
by ji < ■ ■ ■ < je the £ positions in F with Fi = 1. For each k G [£], let ij k = pk, where pk is the 
position in X that is flipped in the kth step. For each j £ [£'] \ {ji, . . . independently draw an 
index ij £ [n] with uniform probability. Accept X if . . . , i^} = [n], otherwise reject X. 

Lemma 4 (Formal restatement of Lemma 1). Let X = (x°, . . . , x ) be a random walk of length 
£ > n\n(2n/5) and perform the experiment above. Then X is accepted with probability at least 1 — 5. 
Moreover, conditional to X being accepted, the random variables x° and x e are independent and 
uniformly distributed. 

Proof. First, by choosing L appropriately, we can ensure with probability at least 1 — 5/2 that 
F contains at least £ ones. By construction, the sequence x'°, (x' 1 , (x /2 , 12), ■ ■ ■ , (x^,, ip) with 
x'° = x°, x /J,fc = Xfc for k € [£], and x /J = x'^ 1 for j G [£'] \ is distributed as an 

updating random walk of length £'. Note that unlike in the original updating random walk model, 
we determine the sequence F of updating outcomes before we determine the positions to be updated. 
Moreover, the choice of the coordinates to be updated in the positions where Fi = 1 is incorporated 
in the draw of the original walk. The subsequence equal to the original walk 

112 f 

%Aj } iXj } iXj } iXj ^ * * * J • 

The probability that {i\, . . . , C [n] is at most 

n ■ (1 - l/nf < n ■ (1 - l/nf < 5/2 

since £ > nln(2n/<5). Consequently, with total probability at least 1 — 5, the random walk is 
accepted. In this case, every coordinate has eventually been updated after the £' steps of the 
updating random walk. Thus, for each coordinate j, of Xj = xf is independent of X j — Xj - 1 . G . ^ X 
and x e are independent and uniformly distributed (conditional to X being accepted). □ 
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B An Elementary Proof of Lemma 2 



To estimate the convergence rate of empirical averages to their expectations, we need the following 
standard Chernoff-Hoeffding bound [10]: For a sequence of independent identically distributed 
random variables Xi,..., X m with EpQ] = /x that take values in [—1,1], 



Pr 



1 m 



< 2e~ e2m/2 



(4) 



Proof of Lemma 2. For each j G {0, . . . , N — 1}, the points x lN+ i , i G {0, . . . ,m/N — 1}, are with 
probability at least 1 — (m/N — 1)5 pairwise independent by Lemma 1. In this case, the values 
f(x lN+ i), < i < m/N — 1, are independent and identically distributed samples of the random 
variable f(x) with x G { — 1, l} ra uniformly distributed. By the Hoeffding bound, 



Pr 



N 



m/N -1 



il E /(**"+>■) -Ex[/(*)] 



i=0 



> e 



< 2exp(-me 2 /(2iV)) 



Thus, the probability that 
is at most 27Vexp(— me 2 /(2iV)). Finally, we have 



(N/m) J2?Jo 1 f{x lN+j ) - E x [f(x)} > e for some j € {0, . . . ,N — 1} 



1 m 

-£/(^)-E x [/(x)] 



i=0 



1 

m 



AT /m/N— 1 

E E /(^ +j )-^E x [/(x)] 

J'=0 \i=0 



V 



m 



e = e 



with probability at least 1 - 2N exp(-me 2 /(2iV)) > 1 - «5. 



□ 
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